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1 One-Way Likelihood Ratio or \ 



2 test 



Suppose we have a set of data x and two hypotheses H R and H s . We wish to know which hypothesis 
explains the data better. To do this, we compute the likelihood ratio 



Assuming the data are i.i.d given each hypothesis, we have P(x\Hj) = YliP(xi\Hj), where J e R,S, 
and thus the likelihood ratio is 



The Bayesian formulation of the problem could be approached by parameterising H R and H s with some 
unknown parameters, 9 R and 9 S , respectively. The posterior distribution over these parameters is then 
given by integrating the likelihoods over all possible values 



These integrations can sometimes be performed analytically, or using some numerical integration tech- 
niques. However, we will focus instead on a simple heuristic method which is related to the x 2 statistics 
discussed above. Note that David MacKay [3j explicitly assumes the parameters have an 'intrinsic' arity 
to them (multinomials with an intrinsic number of bins). This assumption may not be always correct, and 
in fact, may lead to incorrect assumptions. 

Now suppose that the hypotheses are multinomial probability distributions H R = {n, . . . ,r N }, with the 
constraint that £\ r, = 1, and each r, corresponds to some range (bin) of the data x R (and similarly we 
have Si for H s ), then the likelihood ratio can be written as a sum over the N bins by grouping terms in 
Equation [T] into the bins: 



and compare them, choosing the one with the smaller x 2 - 

David MacKay argues effectively for the use of the likelihood ratio [3j. We will see in more detail the 
conditions in which the chi-squared test is not applicable in Section |3} 





(1) 



L = 



log (J P(d R \H R )P(x\9 R , H R )d6 R ) 
log (/ P(9s\H s )P{-x\6s, H s )d6s) 



(2) 
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2 Two-Way Likelihood Ratio Test 



If we wish to compare two sets of data, x R and x s , and ask whether they are drawn from the same 
distribution or from two different distributions, then our first hypothesis is that there are two models H R 
and H s to explain the data, and the second hypothesis is that there is a single model H R+S that explains 
the data. Thus, the question can be formulated as the likelihood ratio 



( P(x R ,x s \H R ,H s ) \ f P(x R \H R ) \ f P(x s |g 5 ) 

° 8 ^ P{x R ,x s \H R+s ) J ° & \P(x R \H R+s )J + ° g {P( XS \H R+S ) 



(3) 



where we have made the assumption that x fi is independent of H s (and vice-versa) if the two distribu- 
tions are different, and that x R is independent of x 5 given H R+S if the two distributions are the same, 
both of which are true given the i.i.d assumption of data given hypotheses. 

The Bayesian formulation of the problem is to parameterise H R ,H S and H R+S with some unknown 
parameters, 9r,9 s and 8 R+ s, respectively. The likelihoods in |3f are then given by integrating over all 
possible parameter values 

, , ( J J P(0r, Os\H r , H s )P(x r , x s \0 r , 8 s , H r , H s )d9 R d9 s \ 

° S \ J P(6 R+s \H R+s )P(x R , x s \6 R+s H R+s )de R+a J [> 

These integrations can sometimes be performed analytically, or using some numerical integration tech- 
niques. However, in this note, we will use the most likely estimate for the parameters, given the data. 
This simple method is related to the x 2 statistics discussed above, but will see some limitations of it in 
Section |jj 

We can estimate the parameters of H R directly from the data, as the most likely estimate using a 
multinomial with values r 2 = RJR, with R { being the number of data points in x fi that fall into bin i, and 
R = J2i R%- Similarly for H s is a multinomial s l = Si/S, and S = £V Si. Finally, we can estimate H R+S 
in the same way given both datasets, to give a multinomial with values (R t + S t )/(R + S). Using the 
same transformation (from data to bins) as above, the likelihood ratio becomes 

iGbins x iGbins v ' 

which is simply the weighted sum of the Kullback-Leibler divergences of the two datasets from the 
average distribution 

L = R- DKLinWpi) + S ■ DKLiSiWpi) 

where p t = is the probability of a data point falling in bin i estimated from both sets of data. It 
is also a symmetrised relative entropy measure comparing the data to its own distribution (e.g. Ri to 
Ri/R) and to the average distribution of both sets of data {{Ri + Si)/(R + S). We can see this better by 
expanding out the logs of fractions as differences of logs and cancelling terms to obtain. 

L = J2U l«g(f ) + Si log(|) - {Ri + Si) log(§±f ) 



or 

L 



pJ2 r * lo g( r >) + s J2 s « lo s( s *) - ( R + 5 ) lo s( ft ) 



The first term is the (negative) entropy of the distribution (scaled by the number of datapoints), the 
second is the negative entropy of Sj, and the third is the entropy of the joint distributions. Denoting 
7r, 7s, ip as the entropy of n, s 4 and p i; respectively, we have 

L = - [R lr + S ls -{R + S) lp ] (6) 



-{R + S) 



R S 



R+S R+S 



(7) 
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where the entropy 7(2:) = -xlog(x). Equation [6]can be understood by noting that if the two distributions 
H R and H s are the same, then averaging them will make no difference to the entropy of the distributions. 
If, on the other hand, H R and H s are different, then the average of the two will have higher entropy. Thus, 
7 P will be larger if the distributions are different, making L also larger (due to the negative sign), which is 
what we expect from the original definition of the likelihood ratio for the two-way problem as given in {3}. 

More precisely, it is the case that the sum of the entropy of any two probability distributions will be less 
than the entropy of their average. To show this, note that the entropy 7(2;) = -x\og(x) is a concave 
function, meaning every point on every chord lies on or below the function [1], so that 

a 7(r) + f3j(s) < j(ar + j3s) 

where a + (3 = 1, and equality is achieved when r = s. By induction, this is true even for a weighted 
sum: _^ _^ 

a ^2 r i lo s( r ») + P 2~Z Silog(si) < 2~2( ar i + /3sj) log(ar; + /3s,) (8) 



If we use a = and (3 = then ■p l = ar { + (3si, and Equation j8| says that the square bracket in 
Equation {7} is always negative, so that L > 0. The extreme cases are 

1 . r» and s l are identical, then L = 0. 

2. ri = for all i where Sj > 0, and s. t = for all i where r t > 0. In this case, either r, or s l is zero, 
and 



L 



-(R + S) 



Since a + 
a = 1 or 0. 



i i 

= -(R + S) [alog(a) + ^log(/3)] 
1, this function has a maximum of (R + S)/2 at a = 0.5, and a minimum of at 



Thus, we can see that < L < \(R + S), with the minimum achieved for identical distributions, and the 
maximum achieved for maximally different distributions. 



3 Two-Way x 2 test 



If instead, we use the two-way \ 2 test, we compute the expected counts, which is the average distribution 
of the two datasets. Since is the average distribution given both sets of data, we have the expected 
counts in bin i for the two datasets as 

In many treatments of this problem, particularly in the biological sciences, the i e {1, . . . , N} are referred 
to as the rows and the datasets {R, S} are referred to as the columns in a contingency table. Typically, 
the rows are a set of features of the data, and the columns are two different datasets, usually obtained 
in two different conditions. 

To answer the question of whether the two datasets are drawn from the same hypothesis or not, we 
formulate the null hypothesis, which states that they are, and then figure out the expected counts as 
above. The chi-squared statistic for the two sets of data is 

2= v v (J i - Ej{i)f = v {IU - E R (i)f v (Sj - E s {i)) 2 

J€{R,S}i£N JW i£N nK ' i£N OK 1 
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putting in the definitions of the expected counts from (9) above, and doing some algebra, we get 



* 2 = E 



Ri + Si 



2' 2 



exactly equation (14.3.3) in [4j. 

This value of x 2 , if large, tells us that the null hypothesis can be rejected, and thus that the distributions 
are likely to be different. To know what "large" means, we can use a chi-squared probability test, that 
gives us the probability that the sum of the squares of v random normal variables of unit variance and 
zero mean will be greater than x 2 EJ. Another way to say this is the probability that a particular value of 
x 2 would have occurred by chance if the null hypothesis was correct. The chi-squared probability test is 
therefore simply the integral of the probability density of the x 2 distribution: 

The number of degrees of freedom in the hypotheses is v. If the two datasets are drawn without regard 
for each other (no constraints on the number of datapoints drawn), then the number of degrees of 
freedom, v, is the number of bins in which one of the datasets has at least one count. Typically, if 
P{x 2 \ v ) < 05 (the "p-value"), the chi-squared test is deemed significant, and the null hypothesis can 
be safely rejected. A simple test that can be used is to reject the null hypothesis if x 2 > v Kl(p661). 



4 One- and Two-Way G-test 

Interestingly, the likelihood ratio can be more formally related to the x 2 test, by considering the G-test, 
defined as [5] 

G = 2j2o i log(O i /E i ) 

i 

where O, is the observed counts and E, is the expected counts. Note that this is simply the Kullback- 
Leibler divergence between observed and expected counts, multiplied by a factor of two. When summed 
over all data points in our two-column example, this is 

g = 2 Ri lo s(i^h) +2$> lo s(/k) 0) 

i e rW i E sW 

putting in the expressions for the expected counts from above (9), we obtain exactly G = 2L, given by 
Equation {5} above. In general, with smaller amounts of data, the chi-squared test will sometimes give 
incorrect answers, whereas the G-test will not, and so is the recommended test j3] |5J. To see in more 
detail why this is so, we can write O t = E t + 5 l} with J2i ^ = so that the total number of counts stays 
the same. The G-test is then 

6 = 2^(^ + ^)10^1 + 1;). 
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If we Taylor expand this around |l = (the point at which Oj and Ei agree), and using log(l + x) 
x - ^ + 0(x 3 ), we get 

i 1 » 




and so, we see that G « x 2 when Oi is close to However, the more 4 and i% are different, the less 
well this approximation will work, and x 2 will tend to compute erroneous answers. The effects of a single 
outlier in a small sample set will be more pronounced, which explains why the x 2 often fails in situations 
with little data. This is the same reason why a linear regression can fail with little data, due to the strong 
effects of outliers. 

Since the x 2 value is just an approximation to the G-value, the G-value can also be used in the chi- 
squared probability test. This method is recommended by most texts on statistics for the biological 
sciences. However, it is unclear why one would want to do this, and what the validity is since the chi- 
squared test is based on the pdf of x 2 ■ The G-test directly gives (twice) the log likelihood of the ratio 
of one hypothesis vs. the other, and so a significance can be attributed directly. However, recall that 
these tests are both based on models or hypotheses whose parameters are derived from the data itself. 
Instead of computing Equation |4} directly, as we should do, we are taking the most likely estimate of 
the parameters 9 R ,9s and 6 R+S (those derived directly from the data), and collapsing the integrals to 
these point estimates. One implication of this is that the G-values will depend on the complexity of our 
models (e.g. the number of bins in our multinomials/histograms). This is simply the model overfitting 
the data: the models derived from each data set R and S will, with enough complexity, perfectly fit the 
data. Therefore, to interpret the G-value from Equation (To|, we must take the complexity of the model 
into account. To evaluate significance, the value of the likelihood ratio (G/2) should be compared to 
the number of degrees of freedom, v. If G > 2v, then the null hypothesis can be safely rejected. This 
corresponds roughly to a p < 0.05. 



5 Likelihood ratio tests for dynamic models 

In the previous sections, we assumed the data were i.i.d distributed, and that the models (hypotheses) 
were simple multinomials. It is also possible that the data are sequentially dependent, such as when 
they come from a dynamic model. For example, if the data arise from a hidden Markov model, then 
the same considerations apply as above. For any type of model H tJ , J e {R, S, R + S} trained on the 
data in J, we can compute each of P(x R \H R ), P(x s \H s ), P(x R \H R+ s) and P(x s \H R+s ), and then use 
Equation (3} to compute the likelihood ratio, and use a chi-squared probability test as usual. If the H are 
hidden Markov models, then the likelihoods will be computed using the standard forward equations [2j. 
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